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Abstract —One of the factors that limits the scale, perfor¬ 
mance, and sophistication of distributed applications is the dif¬ 
ficulty of concurrently executing them on multiple distributed 
computing resources. In part, this is due to a poor understand¬ 
ing of the general properties and performance of the coupling 
between applications and dynamic resources. This paper ad¬ 
dresses this issue by integrating abstractions representing dis¬ 
tributed applications, resources, and execution processes into a 
pilot-based middleware. The middleware provides a platform 
that can specify distributed applications, execute them on mul¬ 
tiple resource and for different configurations, and is instru¬ 
mented to support investigative analysis. We analyzed the exe¬ 
cution of distributed applications using experiments that mea¬ 
sure the benefits of using multiple resources, the late-binding 
of scheduling decisions, and the use of backfill scheduling. 

Keywords-ahstrsLciions; middleware; execution strategies; 
distributed systems 

1. Introduction 

Large-scale science projects 0-0 as well as the long- 
tail of science 0 rely on distributed computing resources. 
While progress has been made in the use of individual re¬ 
sources, a major challenge for high-performance and dis¬ 
tributed computing (HPDC) is developing applications 
that can execute concurrently on multiple resources 

Resources are heterogeneous in their architectures and in¬ 
terfaces, are optimized for specific applications, and enforce 
tailored usage and fair policies. In conjunction with tempo¬ 
ral variation of demand, this introduces resource dynamism, 
e.g., the time varying availability, queue time, load, stor¬ 
age space, and network latency. Executing an application on 
multiple heterogeneous and dynamic resources is difficult 
due to the complexity of choosing resources and distribut¬ 
ing the application’s tasks over them. 

Executing distributed applications on multiple dynamic 
resources requires loosely coupling applications to a prede¬ 
termined set of resources. This enables applications to re¬ 
spond to changes in the availability of resources without 
breaking predetermined application-resource assignments. 
Our work investigates how to integrate information from 
both applications and resources to make coupling decisions. 
In this paper, we study the static coupling between a de¬ 


scription of applications and information about dynamic re¬ 
sources. 

Coupling applications to multiple dynamic resources re¬ 
quires advances at both the conceptual and implementation 
level. We contribute by devising dedicated abstractions, im¬ 
plementing them in middleware, and an experimental evalu¬ 
ation to show the benefits of our methodology. The abstrac¬ 
tions represent: characterization of application tasks (skele¬ 
tons); resource capacity and capabilities (bundles); resource 
management and task scheduling (pilots); and application 
execution (execution strategies). Middleware that includes 
implementations of these abstractions enables the execu¬ 
tion of distributed applications across multiple dynamic re¬ 
sources, and the experimental analysis of the execution pro¬ 
cess and its performance. 

We performed experiments over a year using up to five 
concurrent resources belonging to XSEDE and NERSC, in¬ 
volving more than 20,000 runs of different distributed appli¬ 
cations for a total of 10 million tasks executed. These experi¬ 
ments measured the performance of alternative couplings be¬ 
tween distributed applications and multiple resources. Each 
coupling is described by an execution strategy, i.e., the set 
of decisions made to execute an application. Each strategy 
is characterized by a time to distribute and execute the ap¬ 
plication tasks across the resource. We found that execution 
strategies based on late-binding and backfilling of tasks on 
at least three resources offered the best performance. We ob¬ 
served this to be independent of the combination of resources 
used, the number of application’s tasks, and the distribution 
of the tasks’ duration. 

This work advances traditional reasoning about distributed 
execution by performing both hypothesis-driven and semi- 
empirical investigations. The abstractions advance analytical 
understanding of coupling distributed applications and re¬ 
sources; the middleware implementation advances the state- 
of-the-art of executing specific distributed applications on 
multiple dynamic resources; and experiments improve un¬ 
derstanding of the execution process and its performance. 

In Section 2, we provide a brief summary of related work. 
Section 3 discusses the abstractions we defined alongside the 
architecture and resultant capabilities of implementing the 


abstractions into an integrated middleware. Section 4 dis¬ 
cusses the experimental methodology and design, the results 
of the experiments, and their analysis. Section 5 reviews the 
implications and challenges of the presented work alongside 
its implications for the future of HPDC. 

II. Related Work 

Coupling of distributed applications to multiple resources 
is a well-known research problem. For example, the I- 
Way Q, Legion ||^, Globus Q, and HTCondor |[^ frame¬ 
works integrated existing tools to run distributed applications 
on multiple resources. However, their successes GD, GD 
also exposed their limitations. Deploying mandatory middle¬ 
ware on all resources and the complexity of porting appli¬ 
cations limited adoption and usability GD- The abstractions 
and middleware presented in this paper avoid these limita¬ 
tions by: (1) implementing interoperability among resource 
middleware; (2) assuming multiple resources with dynamic 
availability and diverse capabilities; and (3) understanding 
and measuring performance of alternative coupling between 
distributed applications and those resources. 

The Execution Strategy, Skeleton, and Bundle abstrac¬ 
tions build upon related work. Bokhari GD and Fernandez- 
Baca G3 theoretically proved the NP completeness of 
matching/scheduling components on distributed systems. 
Chen and Deelman |T^ , and Malawski et al. GZl present 
execution strategies modeled as heuristics based on empir¬ 
ical experimentation. Workflow systems like Kepler, Swift 
and Taverna |T^ implement execution management but in 
the form of a single point solution, specifically tailored to 
their execution models. Work from Foster and Stevens GD 
and Meyer et al. | [2Q| resemble some of the features of skele¬ 
tons but are limited only to parallel applications or directed 
acyclic graphs (DAGs). Bundles leverages some of the work 
done in information collection pTf , resource discovery 
| [23| , and resource characterization as it relates to queue time 
prediction and its difficulties | [25| . 

III. Abstractions 

Executing a distributed application requires information 
about the application requirements, and resource availabil¬ 
ity and capabilities. This information is used to choose a 
suitable set of resources on which to run the application 
executable(s) and a suitable scheduling of the application 
tasks. When considering multiple resources and distributed 
applications, bringing together application- and resource- 
level information requires specific abstractions that have to 
uniformly and consistently describe the core properties of 
distributed applications, those of computing resources, and 
those of the execution of the former on the latter. 

A. Application Abstractions 

Most distributed applications that we have observed are of 
two types. The first is Many-Task Computing (MTC) 1^ , 


where applications are composed of tasks, which themselves 
are executables. The tasks can be thought of as having a 
simple structure from the outside: they read files, compute, 
and write files, though they can be quite complex internally. 
They can also be either sequential or parallel. These MTC 
applications fall into a small number of types, specifically 
bag-of-task, (iterative) map-reduce, and (iterative) multistage 
workfiow. Interaction between tasks is usually of the form 
of files produced by one task and consumed by another. 
The tasks can be distributed across resources, and there is 
a framework that is responsible for launching the tasks and 
moving the files as needed to allow the work to be done. 

The second type is applications composed of distributed 
elements that interact in a more complex manner, such as 
by exchanging messages while running, possibly as services. 
The elements of these applications can have persistent state, 
while the MTC tasks do not; their outputs are based solely 
on their inputs, and they are basically idempotent. 

The majority of distributed science and engineering appli¬ 
cations are MTC, while in business, there is much more of a 
mix of the types. Because we are concerned with science and 
engineering application, we currently focus on MTC applica¬ 
tions. We generalize bag-of-task, (iterative) map-reduce, and 
(iterative) multistage workfiow applications into (iterative) 
multistage workfiow applications, since bag-of-task applica¬ 
tions are basically single-stage applications and map-reduce 
applications are basically two-stage applications. 

We abstract these applications because the real applica¬ 
tions can be difficult to obtain and to build, the real input 
data sets may be difficult to obtain, and the real applica¬ 
tions may be difficult to arbitrarily scale. Abstract applica¬ 
tions also can be easily shared and are reproducible. In order 
to abstract these applications, we use a top-down approach: 
an application is composed of a number of stages (which 
can be iterated in groups), and each stage has a number of 
tasks. An application is described by specifying the number 
of stages and the number of tasks, input and output file and 
task mapping, task length, and file size inside each stage. 
Task lengths and file sizes can be statistical distributions or 
polynomial functions of other parameters. For example, in¬ 
put file size can be a normal distribution, task length can be 
a linear function of input file size, and output size can be a 
binomial function of task runtime. 

We call this type of abstract application a “Skeleton Ap¬ 
plication” pT] , 1^ , and have built an open source tool 
(Application Skeleton | [29| ) that can create skeletons. Our 
tool is implemented as a parser that reads in a configuration 
file that specifies a skeleton application, and produces three 
groups of outputs: (1) Preparation Scripts: run to produce 
the input/output directories and input files for the skeleton 
application. (2) Executables: the actual tasks of each appli¬ 
cation stage. (3) Skeleton Application: implemented as: (a) 
shell commands that can be executed in sequential order on 
a single machine, (b) a Pegasus DAG p0| or (c) a Swift 


script | [3T| that can be executed on a local machine or in a 
distributed environment, or (d) a JSON structure that must 
be used by a middleware that is designed to read it. 

The application skeleton tool itself can be called from 
the command line or through an API. Our work here uses 
the skeleton API to call the skeleton code. To execute a 
skeleton application, the preparation scripts are run to create 
the initial input data files, then the skeleton application itself 
is run. The task executables produced by the skeleton tool 
copy the input files from the file system to RAM, sleep for 
some amount of time (specified as the runtime), and copy 
the output files from RAM to the file system. 

We have previously tested the performance accuracy of 
the skeleton applications p7| . We profiled three represen¬ 
tative distributed applications—Montage p^ , BLAST p^ , 
CyberShake-postprocessing p4| —then derived appropriate 
skeleton parameters. We showed that the application skele¬ 
ton tool produced skeleton applications that correctly cap¬ 
tured important distributed properties of real applications but 
were much simpler to define and use. Performance differ¬ 
ence between the real applications vs the skeleton applica¬ 
tions were -1.3%, 1.5%, and 2.4%. Fourteen out of fifteen 
application stages had differences of under 3%, ten had dif¬ 
ferences under 2%, and four had differences of under 1%. 

B. Resource Abstractions 

Very few scientific applications run on dedicated resources 
owned by the end user. As a consequence, most of the users 
of these applications have to share resources that have dy¬ 
namic availability and diverse capabilities. This introduces 
complexity for both the application developer and end user. 
We mitigate this complexity with a resource abstraction. 

Resource allocation for applications can be either static 
or dynamic. Allocation is static when users select resources 
based on knowledge of capacity, performance, policy, and 
cost. Often, this decision is made on an ad hoc basis. Allo¬ 
cation is dynamic when users, or code running on their be¬ 
half, monitor resource status and adjust resources as needed. 
Both static and dynamic resource allocation require resource 
characterization but despite its importance, we observed that 
we lack systematic approaches to such a characterization. 

We developed an abstraction to bridge applications and 
diverse resources via uniform resource characterizations. 
In this way, we facilitate efficient resource selection by 
distributed applications. Our resource abstraction is called 
“Bundle” to connote the characterization of a collection of 
resources. 

Our implementation of the Bundle abstraction is the re¬ 
source bundle, which may be thought of as representing 
some portion of system resources. A resource bundle may 
contain an arbitrary number of resource categories (e.g., 
compute, storage) but it does not “own” the resources. In 
this way, a resource may be shared across multiple bundles 


and users can be provided with a convenient handle for per¬ 
forming aggregated operations such as querying and moni¬ 
toring. Resource bundles are used to enable applications to 
make more effective resource allocation decisions. 

A resource bundle has two components: resource repre¬ 
sentation and resource interface. The resource representa¬ 
tion characterizes heterogeneous resources with a large de¬ 
gree of uniformity, thus hiding complexity. Currently, the 
resource bundle models resources across three basic cate¬ 
gories: compute, network, and storage. Resource measures 
that are meaningful across multiple platforms are identified 
in each category. For example, the property “setup time” of 
a compute resource means queue wait time on a HPC cluster 
or virtual machine startup latency on a cloud p5j . 

The resource interface exposes information about re¬ 
sources availability and capabilities via an API. Two query 
modes are supported: on-demand and predictive. The on- 
demand mode offers real-time measurements while the pre¬ 
dictive mode offers forecasts based on historical measure¬ 
ments of resource utilization instead of queue waiting time, 
which is extremely hard to predict accurately p4| , p5| , p6| . 

The resource interface exposes three types of interface: 
querying, monitoring, and discovering. The query interface 
uses end-to-end measurements to organize resource informa¬ 
tion. For example, the query interface can be used to inquire 
how long it would take to transfer a file from one location 
to a resource and vice versa. Although file transfer times 
are difficult to estimate (37), proper tools | [38| are capable 
of providing estimates within an order of magnitude, which 
are still useful. 

The monitoring interface can be used to inquire about re¬ 
source state and to chose system events for which to receive 
notification. For example, performance variation within a 
cluster can be monitored so that when the average perfor¬ 
mance has dropped below a certain threshold for a certain 
period, subscribers of such an event will be notified. This 
may trigger subsequent scheduling decisions such as adding 
more resources to the application. 

The discovery interface, which is future work, will let the 
user request resources based on abstract requirements so that 
a tailored bundle can be created. A language for specifying 
resource requirements is being developed. This concept has 
been shown to be successful for storage aggregates in the 
Tiera project p9| , where resource capacities and resource 
policies are specified in a compact notation. 

C. Dynamic Resource Abstractions 

Pilots generalize the common concept of a resource place¬ 
holder. A pilot is submitted to the scheduler of a resource, 
and once active, accepts and executes tasks directly submit¬ 
ted to it. In this way, the tasks are executed within the time 
and space boundaries set by the resource’s scheduler for the 
pilot, trading the scheduler overhead for each task with an 
overhead for a single pilot. 


Pilots have proven very successful at supporting dis¬ 
tributed applications, especially those with large scale, sin¬ 
gle/low cores tasks. For example, the ATLAS project [ [40| 
processes 5 million jobs every week ED with the Production 
and Distributed Analysis (PanDA) system and uses pilots to 
execute single-task jobs. The Open Science Grid (OSG) | [42| 
deploys HTCondor and a pilot system named “Glidein” to 
make available 700 million CPU hours a year for appli¬ 
cations requiring high-throughput of single-core tasks | [43| . 
Pilots are also used by several user-facing tools to execute 
workflows. Swift implements pilots in a subsystem named 
“Coasters” | |44| , Fire Works employs “Rockets” | [45| , and Pe¬ 
gasus uses Glidein via providers like “Corral” 

Most existing pilot systems are part of resource-specific 
middleware or of a vertical application framework; they 
have also not been instrumented so as to provide accu¬ 
rate information about their internal state and operations. To 
avoid these limitations, we utilize and extend RADICAL- 
Pilot a pilot system that does not need to be deployed 
in the resource middleware and exposes a well-defined pro¬ 
gramming interface to user-facing tools. RADICAL-Pilot 
uses RADICAL-SAGA | |47| (the reference implementation 
of the SAGA OGF standard | [48| ) to submit pilots and ex¬ 
ecute tasks on multiple resources. Timers and introspection 
tools record each state transition and the state properties 
of each RADICAL-Pilot component. These capabilities are 
needed to tailor distributed application execution to diverse 
use cases, but to the best of our knowledge, they are miss¬ 
ing in other pilot systems. RADICAL-Pilot also adheres to 
recent advances in elucidating the pilot paradigm | [49| . 

D. Execution Abstractions 

Coupling applications and resources is a matching pro¬ 
cess that depends on information and a set of decisions. 
Information about the application requirements and both the 
resources availability and capabilities is collected. This in¬ 
formation is then integrated and used to take decisions about 
the amount and type of resources that are needed to satisfy 
the application requirements, for how long these resources 
need to be available, how data should be managed, and how 
the application should be executed on the chosen resources. 

We use “Execution Strategy” to refer to all the decisions 
taken when executing a given application on one or more re¬ 
sources. We use this set of decisions to describe the process 
of coupling applications to resources. Execution Strategy is 
the abstraction while the set of choices made for each of its 
decision is one of its realizations. We call these realizations 
“execution strategies” or simply “strategies”. 

We use the Execution Strategy abstraction to make ex¬ 
plicit the decisions that, traditionally, remain implicit in the 
coupling of applications and resources. In the presence of 
multiple dynamic resources, individual user knowledge and 
experience or best practices are not sufficient to identify al¬ 
ternative ways to couple an application to resources, and to 


understand their performance trade offs. Once the decisions 
are made explicit, they can be integrated into a model and 
their effects can be measured empirically. 

An Execution Strategy can be thought of as a tree, where 
each decision is a vertex and each edge is a dependence re¬ 
lation among decisions. In the simplest case, there is only 
one choice (i.e., value) for each decision that enables the 
execution of the application. When considering multiple re¬ 
sources, the number of choices, the information needed to 
make those choices, and the complexity of the decision pro¬ 
cess all increase. For example, resources can be used con¬ 
currently; the distribution of tasks among resources can be 
uneven; multi-level scheduling may be used for late binding, 
or alternative scheduling algorithms may be available. The 
dependency among decisions defines their sequence. For ex¬ 
ample, the decision to schedule tasks on each resource may 
depend on the number of resources that has been chosen. 

Once enacted, each execution strategy executes an appli¬ 
cation, but different strategies lead to better or worse perfor¬ 
mance, as measured using diverse metrics. For example, ex¬ 
ecution strategies may differ in terms of time-to-completion 
(TTC), throughput, energy consumption, affinity to specific 
resources, or economic considerations. Each metric’s rele¬ 
vance depends on the user’s and application’s qualitative re¬ 
quirements. TTC is one of the most relevant metrics; we 
investigated distributed application execution by experiment¬ 
ing whether, how, and why alternative execution strategies 
lead to different TTC for the same application. 

Execution Strategies are realized in a software module 
called “Execution Manager”. This module derives and en¬ 
acts an execution strategy in five steps: (1) information is 
gathered about an application via the skeleton API and about 
resources via the bundle API; (2) application requirements 
and resources availability and capabilities are determined; 
(3) a set of suitable resources is chosen to satisfy the appli¬ 
cation requirements; (4) a set of suitable pilots is described 
and then instantiated on the chosen resources; and (5) the 
application is executed on the instantiated pilots. 

The choices made in steps 3 and 4 depend on whether an 
optimization metric is given to the Execution Manager. Eor 
example, given a distributed application composed of inde¬ 
pendent tasks and the TTC metric, the Execution Manager 
selects a set of resources to achieve maximal execution con¬ 
currency and minimize the execution time. Similar decisions 
are made depending on the amount of data that needs to be 
staged, the bandwidth available between resources, the set 
of data dependences, etc. Note that this type of optimiza¬ 
tion uses semi-empirical heuristics. Eor example, algorith¬ 
mic considerations and empirical evidence about pilots and 
resources behavior need to be known to decide whether to 
execute: (1) a bag of 2048 single-core, loosely-coupled tasks, 
with a varying number of cores, duration, and input/output 
data on a single large pilot; or (2) the same bag of tasks on 
three smaller pilots instantiated on different resources. 



Figure 1. High-level architecture of the AIMES middleware. (1) provides 
distributed application description; (2) provides resource availability and 
capabilities; (3) derives execution strategy; (4-6) enacts execution strategy. 

E. Integration 

We implemented the four abstractions—Skeleton Applica¬ 
tion, Bundle, Pilot, and Execution Strategy—as Python mod¬ 
ules, then integrated them into the AIMES (Abstractions and 
Integrated Middleware for Extreme-Scales) middleware (see 
Eigure [^. This middleware offers two distinguishing fea¬ 
tures: self-containment, meaning no components need to be 
deployed into the resources, and self-introspection, meaning 
that its state model is explicit and instrumented to produce 
complete traces of an application execution. 

The AIMES middleware realizes the coupling of a dis¬ 
tributed application to multiple dynamic resources. The cou¬ 
pling is described via an execution strategy and then enacted 
via the pilot and interoperability modules. In this way, the 
AIMES middleware integrates the implementations of the 
presented abstractions but also the information about appli¬ 
cation and resources required by their coupling. Thanks to 
the self-introspection, the AIMES middleware can work as 
an experimental laboratory: The enactment of the coupling 
process can be measured in all its stages to understand the 
correlations among the choices of an execution strategy and 
the performance of the execution process. 

The AIMES middleware exposes capabilities at the ap¬ 
plication, resource, and execution management level. We 
use the skeleton module to describe a distributed application 
composed of many executable tasks by specifying their num¬ 
ber, duration, intercommunication requirements, data depen¬ 
dences, and grouping into stages. We also use the bundle 
module to describe a set of resources. Bundle information 
includes resource capacity, configuration, and availability in 
terms of utilization, queue state, queue composition, and 
types of jobs already scheduled for execution. We integrate 
the application and resource information deriving an execu¬ 
tion strategy and enacting it via RADICAL-Pilot. 

The AIMES middleware uses the skeleton API to acquire 


the application’s description (Eigure step 1). The same is 
done for the resources via the bundle API (Eigure step 
2a/b). The Execution Manager derives an execution strategy 
(Eigure step 3) and then enacts it (Eigure step 4, 5, 
6). Pilots are described via RADICAL-Pilot (Eigure step 
4) and scheduled on the chosen resources via RADICAL- 
SAGA (Eigure step 5). Once the pilots become active, 
tasks’ input files are staged on the resources of the active 
pilots and then tasks are scheduled and executed on those 
pilots (Eigure step 6). Tasks are automatically restarted in 
case of failure and, once executed, task output(s) are staged 
back to the source where the AIMES middleware is being 
used. Einally, all pilots are canceled when all tasks have 
executed so as not to waste resources. 

IV. Experiments 

We used the AIMES middleware to analyze alternative 
couplings of applications and multiple dynamic resources, 
and to measure their performance. We described the cou¬ 
plings by means of alternative execution strategies and ob¬ 
served the relation between the choices of each execution 
strategy and the dynamic behavior of the multiple resources. 
We acquired data over one year, measuring experiment per¬ 
formance on four XSEDE and one NERSC resources. 

A. Methodology and Design 

We compare the performance of our execution strategies 
by measuring applications TTC: the sum of a set of possibly 
overlapping time components. Examples of time components 
are the time taken by each task to execute or the time taken 
for a pilot waiting in a queue to be instantiated. These com¬ 
ponents can overlap as tasks can execute concurrently, and 
a pilot can be queued while tasks execute on an active pilot. 
Differences in TTC depend on the duration and overlap of 
its time components: the shorter the time component and/or 
the higher their overlap, the lower the TTC. 

We instrumented the AIMES middleware to record ev¬ 
ery TTC time component related to middleware overhead, 
resource dynamism, task execution, and data staging. We 
isolated the differences in TTC arising from the choices in 
an execution strategy by running experiments with the same 
set of tasks and different execution strategies. In this way, 
by comparing the time components of the TTC, we gained 
insight into whether and how the execution strategies differ 
in the duration of time components and/or their overlap. 

We focused our experiments on the time components re¬ 
lated to the resource dynamism. Specifically, we measured 
and analyzed the advantage in performance that can be ob¬ 
tained by using multiple pilots, late-binding, and back-filling 
scheduling to increase execution overlapping while minimiz¬ 
ing queuing time duration. 

We designed four experiments, each measuring the impact 
of an execution strategy on 9 skeletons (see Table |I]). Each 
skeleton is a distinct application that belongs to the same 




















Table I 

Skeleton applications and execution strategies used eor the experiments. Each application task runs on a single core. 

Tx = ESTIMATED WORKELOW EXECUTION TIME; Tg = ESTIMATED TOTAL DATA STAGING TIME; Trp = AIMES MIDDLEWARE OVERHEAD. 


Experiment 

Skeleton Application 



Execution Strategy 


ID 

#Tasks 

Task Duration (min) 

Binding 

Scheduler 

#Pilots 

Pilot Size 

Pilot Walltime 

1 

2 

2",n = [3,11] 
2’»,n = [3,11] 

15 

1 — 30 (trunc. Gaussian) 

Early 

Direct 

1 

^Tasks 

Tx Tg Trp 

3 

4 

2’»,n = [3,11] 
2”,n = [3,11] 

15 

1 — 30 (trunc. Gaussian) 

Late 

Backfill 

1-3 

asks 

i^Pilots 

(Tx Tg Trp) • ^Pilots 


application class (bag-of-task) but differs in size as measured 
by the number of tasks. Applications vary in size between 
8 and 2048 single-core tasks, and with task length of 15 
minutes or distributed following a truncated Gaussian (mean: 
15 min.; stdev: 5 min.; bounds: [1-30 min.]). The execution 
strategy decisions are: (1) early or late binding of tasks to 
pilots; (2) the type of scheduler used to place tasks on pilots; 
(3) the number of pilots; (4) their size; and (5) their walltime. 

The size of the application and the durations of its tasks 
were selected to be consistent with several scientific appli¬ 
cations that could benefit from distributed execution | [50l . 
Furthermore, the chosen durations are also consistent with 
about 35% of the jobs executed on XSEDE every year. XD- 
MoD [ST] shows that in 2014, more than 13 million jobs 
were executed on XSEDE with durations between 30s and 
30m, 36% of the total XSEDE workload. Between 2010- 
2013, 25% to 55% of the XSEDE workload was within our 
experimental parameters. 

Because skeletons have been shown to replicate the behav¬ 
ior of real-life applications like Montage, Blast, and Cyber- 
Shake with minimal error | [27l , we used Skeleton workloads 
to enable better control over the workload parameters and 
over the exploration of regions of decision space. 

Each experiment combined an execution strategy with one 
skeleton and its nine applications. In total the four experi¬ 
ments consisted of 36 unique applications. Each application 
was run many times depending on run-to-run fluctuation. 
The execution order of the 36 applications was varied to 
avoid correlation between measurements. The applications 
were also executed at irregular intervals so as to avoid ef¬ 
fects of short-term resource load patterns. The order in which 
pilots were submitted to the resources was randomized to ac¬ 
count for differences in submission time across resources. 

Table |l| shows that we selected a subset of all the possi¬ 
ble combinations between the given applications parameters 
and the choices available for the given execution strategy. 
We discarded the following combinations because they are 
redundant, uninformative, or ineffective: early binding and 
multiple pilots; late binding and multiple pilots with enough 
cores to execute all the tasks concurrently; early/late bind¬ 
ing on pilots with the same walltime; early/late binding with 
the same schedulers. 

In early binding tasks are bound to the pilots before they 


become active. Given more than one pilot, the TTC of ex¬ 
ecution strategies with early binding is determined by the 
last pilot becoming active and all its tasks being executed. 
In late binding the tasks are instead bound to pilots as soon 
as they become active. In the worse case, the TTC of execu¬ 
tions with late binding depends on the time taken to execute 
all the tasks on the first available pilot. Given the same tasks 
and the same pilots, the TTC of an execution strategy with 
early binding is always equal or greater than one with late 
binding. They are equal when multiple pilots become active 
at the same time; when pilots become active at different 
times, TTC of an execution strategy with early binding is 
greater than with late binding. We therefore perform exper¬ 
iments with early binding and only one pilot. 

Late binding with multiple pilots is a special case of early 
binding with a single pilot when the pilots have enough 
cores to concurrently execute all the tasks. With both bind¬ 
ings, all the tasks are executed on a single pilot as soon as 
it becomes available. The remaining two pilots do not con¬ 
tribute to the TTC of the application and are an overuse of 
resources. Accordingly, we use late binding and pilots with 
fewer cores that those required to execute all the given tasks. 
As a consequence, some of the application’s tasks are exe¬ 
cuted sequentially requiring more time compared to a fully 
concurrent execution. The pilots we use for early and late 
binding have therefore different walltimes. 

Experiment with early and late binding on a single pi¬ 
lot have the same TTC because all the tasks are executed 
as soon as the pilot becomes active. As such, we do not 
perform experiments with late binding and a single pilot. 
We also use a single scheduler for early binding because 
using different schedulers would measure the performance 
of the scheduler implementation, something not directly re¬ 
lated to the problem of coupling an application to multiple 
resources of multiple DCI. Analogously, we do not perform 
experiments to measure the performance of different sched¬ 
ulers for late binding. 

B. Results and Analysis 

We analyzed the experimental data comparing the perfor¬ 
mance of different execution strategies for the given applica¬ 
tions. We explained the observed performance differences by 
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Figure 2. Comparison of TTC for experiments 1-4 shows large variations 
of the TTC in experiment 1 and 2 and smooth progression of TTC in 
experiment 3 and 4. 

isolating the components of the execution process and mea¬ 
suring how their duration and overlapping contributed to the 
execution TTC. Figure shows that for the strategies listed 
in Table those using late binding and backfill scheduling 
(Exp. 3 & 4, green and purple lines) have shorter TTC, and 
therefore perform better, than strategies using early binding 
and direct scheduling (Exp. 1 & 2, azure and red lines). 

Figure shows the three main components of TTC for 
each experiment: (1) time setting up the execution includ¬ 
ing waiting for the pilot(s) to become active on the target 
resource(s) green); (2) time executing all the applica¬ 
tion tasks on the available pilot(s) purple); and (3) time 
staging application data in and out (T^, azure). 

Ts depends on the application specifications and has been 
restricted to a small percentage of the overall TTC by exper¬ 
imental design. Larger amounts of data could make Tg dom¬ 
inant (Fig. and a set of dedicated experiments would be 
required to investigate the differences among strategies with 
decisions about, for example, compute/data affinity, amount 
of network bandwidth available between the origin of the 
data and the target resource(s), or the number and location 
of data replicas. This is the subject of future work. 

Tx also depends on the application specifications. Our task 
durations are either 15 minutes or a truncated Gaussian dis¬ 
tribution between 1 and 30 minutes. However, Tyj depends 
mostly on the resource’s queuing time. This is determined 
by the resource load, the length of its queue, and the policies 
regulating priorities among jobs and usage fairness among 
users. As such, Tyj is outside user and middleware control. 

Looking at how Tg, and Tyj contribute to the applica¬ 
tion TTC, we note that in Figure Tg is consistent across 
the four execution strategies, contributing to the TTC pro¬ 
portionally to the number of tasks executed. This is by de¬ 
sign as every task requires a single input file of 1 MB and 
produces a single output file of 2 KB. As a consequence, 
the more tasks are executed, the more time is spent staging 
input and output files. 


Tx varies mostly over bindings, with the late binding 
strategies requiring roughly 1/3 more time on average to 
execute tasks than those using early binding. This is also 
by design as the late binding strategies use up to 3 pilots, 
each one with 1/3 of the cores used by the pilot of the early 
binding strategies. As expected, the contribution of Tx to the 
TTC is proportional to the number of tasks executed but it 
increases with a steeper gradient above 256 tasks due to the 
overheads introduced by the AIMES middleware. 

Tyj is the TTC component with the most variation among 
experiments and the most contribution to TTC. The varia¬ 
tion is more pronounced for early binding (between 600 and 
8,600 seconds) less for late binding (between 99 and 2,800 
seconds). In early binding, Tyj has large variation vs. number 
of tasks, and between uniform and Gaussian distribution of 
task durations. In late binding, has a smooth increase vs. 
number of tasks and almost no variation between uniform 
and Gaussian distributions of task durations. 

Figure shows also the dominance of Tyj contribution to 
TTC. In the early binding strategies (Figures (a) and (b)), 
the red line (TTC) and the green line {Tyj) have the same 
shape, with the variations in TTC determined by equivalent 
variations in Tyj. Tyj dominates also the TTC of the late 
binding strategies. Figures (c) and (d) show how ac¬ 
counts for almost all the TTC for runs with up 256 tasks, 
and most of the TTC for the remaining number of tasks. 

The difference in performance across strategies is there¬ 
fore due to the duration of Tyj . On average, when using three 
pilots, the first pilot takes less time to become active than 
when using a single pilot. Figures (a) and (b) offer a pos¬ 
sible explanation of this behavior. The large error bars of 
Figure (a) shows the variability of Tyj for the same job 
submitted multiple times to the same resource with early 
binding. The variability between uniform and Gaussian dis¬ 
tributions at 16 and 1024 tasks (Figure is also a strong 
indication of how widely TTC for early binding varies. Fig¬ 
ure |^(b) shows instead small error bars across all task sizes. 
Due to the large variation in Tyj on each single resource, 
when using multiple resources, it is more likely to find at 
least one of the resources with a comparatively short Tyj. 

It is interesting that this large variability is already over¬ 
come by using three resources (pilots) as shown in Figure 
(b). Depending on availability, the resources are chosen from 
a pool of five resources that are not only dynamic as shown 
by the variability of but also heterogeneous in their in¬ 
terfaces, size, utilization patterns, and possibly job priority 
and fair usage policies. This hints to a generality of the ob¬ 
served behavior but future experiments with more resources 
will explore its relation with resource homogeneity and its 
statistical characterization. 

Note that in the context of our experiments, early binding 
would still be desirable for applications with a duration of 
Tx long enough to make the worse case scenario of Tyj neg¬ 
ligible. In this case, applications with early binding would 
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Figure 3. TTC and its time constituents presented for each experiment in Table as a function of the distributed application size. Tw = pilot setup and 
queuing time; Tx = execution time; Ts = input/output files staging time. During execution {Tw), (Tx), and (Tg) overlap so TTC < Tw -\-Tx + Tg. 


have better TTC than those with late binding because of 
the single pilot’s larger size and therefore the greater level 
of concurrent execution it would support for the application 
tasks. Both space and time efficiency would be maintained as 
all the pilot cores would be utilized and no walltime would 
be used on the target resource beyond the duration of T^. 

The analysis points to three main results: First, the use 
of execution strategies enables the quantitative comparison 
among alternative couplings between application and re¬ 
sources both in terms of scale and TTC (compare Figure 


TTC Early Uniform (Exp. 1) 
1 pilot 



(a) and (b)). Usually, users couple application and resources 
on the basis of conventions, via trial and error, or in the 
only way allowed by the middleware they use. This limits 
the scalability of applications and the effective utilization of 
multiple and dynamic resources. 

Second, within the given experimental parameters, the 
normalization of the notoriously unpredictable queuing 
time | [25| on HPC resources is both measured and shown 
to depend on distributing the execution of tasks on multiple 
pilots instantiated across at least three resources. This im¬ 
provement is also shown to be independent of the number 
of tasks, the distribution of their durations, or the DCI on 
which pilots are instantiated. 

Third, the experiments confirm that the AIMES middle¬ 
ware implementing the abstractions defined in pn| can ac¬ 
quire and integrate information about the application and re¬ 
source layer, and can use it to execute application over mul¬ 
tiple resources and DCI at scale. While other middleware en¬ 
able similar capabilities, AIMES middleware is lightweight, 
executed on the users’ systems, and with no requirements 
on the software stack installed on each target resource. 

Together, these results give users and middleware devel¬ 
opers the opportunity to base execution decisions on mea¬ 
surable performance differences of the coupling between ap¬ 
plication and multiple dynamic resources. 


TTC Late Uniform (Exp. 3) 
3 pilots 
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Figure 4. Figures (a) and (b) show the TTC for early and late binding. 
Differences in the size of the relative errors in (a) and (b) are consistent 
with the variance of Tw observed in Figure 


V. Conclusions 

This paper offers three main contributions to the issue 
of coupling distributed application to multiple dynamic re¬ 
sources: (1) abstractions to represent application’s tasks, re¬ 
source availability and capabilities, and execution process; 
(2) the implementation of these abstractions in middleware 
designed to work as a virtual laboratory for distributed ap¬ 
plication experiments; (3) experimental comparison of exe¬ 
cution strategies and their performance. 

Together, these contributions have potential for far reach¬ 
ing practical and conceptual impact. In 2005-9, an attempt 
was made to use multiple TeraGrid resources to execute 
0(1000) parallel and concurrent tasks | [52| . The attempt 
was ultimately unachievable due to the complexity and chal¬ 
lenges of scale. In contrast, we routinely executed 0(1000) 
parallel and concurrent tasks on multiple resources thanks 

























to more scalable infrastructure and to the effectiveness of 
the abstractions underlying our implementation. Unlike other 
middleware, this is achieved without requiring the installa¬ 
tion of any software on the resources and in compliance 
with their policies that don’t permit services to be run on 
the head nodes. 

The AIMES middleware is an incubator for technologies 
to contribute to the HPDC community. The interoperability 
layer of our middleware abstracts the properties of diverse 
resources (Beowulf and Cray clusters, HTCondor pools, 
Unix workstations) and it has been adopted by the LHC 
ATLAS experiment to execute a mix of job sizes on DOE 
supercomputers pS] . Analogously, RADICAL-Pilot is be¬ 
ing used as a stand-alone pilot framework for several project 
spanning molecular, bioinformatic, and earth sciences. 

The AIMES middleware is also a virtual laboratory. We 
are extending the current experiments and starting new ones. 
We have added support for distinct DCI worldwide including 
OSG, XSEDE, NERSC, LRZ, UKNSS, and EutureGrid/Eu- 
tureSystems. We are extending the experiments presented in 
this paper to up to 17 resources and generalizing to inves¬ 
tigate different metrics including throughput. We are also 
adding network information to the resource bundle to ex¬ 
periment with execution strategies for data-intensive appli¬ 
cations, and we started to experiment with distributed appli¬ 
cations comprised of non-uniform task sizes. This will let 
us better understand and mimic current DCI workloads. 

In addition to exploring and measuring the dynamism of 
resources, we are also planning to investigate the dynamism 
of distributed applications. To this end, we have integrated 
Swift | [3T| with the middleware discussed in this paper. We 
have begun experimenting with a greater range of applica¬ 
tions and with ways to decompose Swift workflows to adapt 
to resource availability and capabilities. Ultimately, we will 
also study dynamic execution where application strategies 
change during execution to maintain the coupling between 
dynamic workloads and dynamic resources. 

The coupling and execution of workflows with dynamic 
task provisioning and data dependences will require execu¬ 
tion strategies with large number of decisions. Eor example, 
decisions will concern heterogeneity of the size and duration 
of pilots, the evenness of their distribution across resources, 
the number and preemption of their instantiation on each 
resource, replication of data, and data/compute affinity. All 
these decisions will have to be evaluated not only for TTC 
but also for other metrics involving execution reliability, re¬ 
source availability, allocation consumption, and energy effi¬ 
ciency. 
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